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ABSTRACT 


Cone regression is a particular case of quadratic programming that minimizes a weighted sum of squared 
residuals under a set of linear inequality constraints. Several important statistical problems such as 
isotonic, concave regression or ANOVA under partial orderings, just to name a few, can be considered as 
particular instances of the cone regression problem. Given its relevance in Statistics, this paper aims 
to address the fundamentals of cone regression from a theoretical and practical point of view. Several 
formulations of the cone regression problem are considered and, focusing on the particular case of 
concave regression as example, several algorithms are analyzed and compared both qualitatively and 
quantitatively through numerical simulations. Several improvements to enhance numerical stability and 
bound the computational cost are proposed. For each analyzed algorithm, the pseudo-code and its 
corresponding code in Scilab are provided. The results from this study demonstrate that the choice of the 
optimization approach strongly impacts the numerical performances. It is also shown that methods are 
not currently available to solve efficiently cone regression problems with large dimension (more than many 
thousands of points). We suggest further research to fill this gap by exploiting and adapting classical 
multi-scale strategy to compute an approximate solution. 


Keywords: cone regression, concave regression, convex quadratic programming, linear complemen¬ 

tarity problem, proximal gradient methods 


1 INTRODUCTION 


Cone regression analysis is a valuable alternative to more traditional parametric-regression models, 
in all cases where the functional relationships between the response (dependent) and the explanatory 
(independent) variables is unknown and nonlinear and the constraints are a set of linear inequalities. 
Several important statistical problems including isotonic, concave and constrained spline regression, or 
ANOVA under partial orderings can be seen as particular instances of the more general cone regression 
problem. Cone regression admits several formulation approaches and implementation strategies, whose 
choice severely impacts numerical performances. However, due to the little exposure to the topics of 
optimization theory in modern-day Statistics, many optimization and numerical approaches are commonly 
ignored by statisticians. This paper is a contribution to fill this gap in the literature, addressing the 
fundamentals of cone regression from a theoretical and practical point of view. With the goal of going in 
deep with comparisons and numerical issues, we focus in particular on the concave regression problem. 
In spite of its theoretical simplicity, since the number of constraints increases linearly with the data 
size, concave regression offers a good basis to discuss the fundamentals of cone regression and related 
numerical issues. 

The problem of concave regression is to estimate a regression function subject to concavity constraints 
represented by a set of linear inequalities. Brought to the attention of the scientific community by 
micro-economists interested in estimating production function Hildreth] ([T954]) ; [Dent] ([1973]) ; [Holloway] 
( |1979| ), the problem of concave regression arises not only in the field of micro-economy (indirect utility, 
production or cost functions, Laffer curve) but also in medicine (dose response experiments) and biology 
(growth curves, hazard and failure rate in survival analysis). First addressed by Hildreth in 1954 |Hildreth| 
(1954}, the search for efficient methods for solving large concave regression problems is still an open issue 
nowadays. This may appear quite surprising considering the noticeable advances in convex optimization 
since then, but it can be probably understood when considering that most efforts have been devoted to 
theoretical issues such as generalizations and convergence while comparatively little attention has been 
paid to the issues of efficiency and numerical performances in practice |Perkins| |2003| ); |GouId| ( |2008l ); 
|Censor et al.| ( |2Q09| ). 





























In this paper, we formulate the cone regression problem by different optimization approaches, we 
highlight similarities and difference between the various algorithms passed in review, we propose several 
improvements to enhance stability and to bound the computational cost and we estimate the expected 
performance of available algorithms, establishing in particular which is the most competitive technique 
for solving large instances of the problem. Finally, in the light of this study, we give recommendations for 
further research. 

In section [2j we state formally the problem of cone regression also introducing some basic notations 
and results that will be used thoroughly. In section [3] we survey the state of the art distinguishing 
between the class of algorithms with asymptotic convergence and the class of algorithms with time finite 
convergence. In section [6] we make a numerical comparison of performances and finally, in section [7] we 
draw some concluding remarks. 

2 STATEMENT OF THE PROBLEM, BASIC NOTATIONS AND BASIC FACTS 

The aim of a regression analysis is to produce a reasonable analysis to the unknown response function /, 
which can be modeled as 

y = f(z) + e (l) 

where is the explanatory (dependent) variable, y G 8% d is the response (independent) random 

variable, and £ is an error term, which is usually assumed to be a mean zero random variable. Typically, 
one has observations on y and z for n selected values of z- For each level of input, say zu there may be 
several trials and corresponding observations of output y;. Let 7] be the number of trials at level of input 
Zi and let yu be the observed output for the t —trial at this level. Than we have 

yit=f(zi) + £u, t = (2) 

Inference about the response function may be drawn by assuming that the function f(z) can be 
approximated by some given algebraic form with several unknown parameters to be estimated from the 
data. However, the difficulty with this procedure is that the inferences often depend critically upon the 
algebraic form chosen. Alternatively, one may know some properties of the relation being studied but 
does not have sufficient information to put the relation into any simple parametric form. In this case, a 
nonparametric approach is more appropiated. Let x; be the expected value of output at input level z{. 

Xi = f{zi ) i=l,...,n (3) 

Estimates of x- t can be derived by the method of maximum likelihood, or by the method of least squares or 
other formulations. If there were no a priori restriction on /, than the maximum likelihood estimation of 
xt would just be the mean of observed output for the level of input zu that is 

1 i=l 

Xi = — £}’,> i=l,...,n (4) 

1 i 7 } 

Instead, since in the cone regression problem the known property of the regression function / can be 
expressed by a set of linear inequalities, to obtain the maximum likelihood estimates, the likelihood 
function should be maximized subject to the linear inequality constraints. 

Formally, given a dataset of n dependent variables represented by the vectors w,y G & n , corresponding 
to the independent variable values z\ < zi < ... < the problem of cone regression is to estimate the 
closest function to the dataset via a least squares regression subject to a set of linear inequality constraints 
by solving 

x = argmin||x —yllG (5) 

{*"< 0 } 


n 

with \\x-y\\ 2 tW = Y, w i(yi~Xi) 2 
i= 1 

Denoting by those vectors that satisfy the linear inequality constraints for a fixed i, then ^ 0 
is a closed convex set in 8% n and the feasibility set can be written as the nonempty intersection of 
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2.1 Convex quadratic programming (CQP) formulation 



Figure 1 . The polar cone J(f° of a given convex cone C is given by the set of all vector whose 
scalar product with vectors of is negative. The data point y can be written as the sum of x, the 
projection onto the cone and x°, the projection onto the polar cone Jf°. 


a family of closed subsets C 8% n . Being the intersection of closed convex sets, the set is also a 
closed convex set. More precisely, since each is an half-space which contains the origin, the feasibility 
set is a convex polyhedral cone. In matrix form, can be written as — {x : Ax < 0}. In the case 
of concave regression A E & mxn with m = n — 2 is a matrix such that each row A; represents a concave 
two-piece linear function with a negative second difference at Xi+\ only and the linear inequalities are as 
follows 


x i+2 x i +1 x i +1 x i ^ 1 0 

-< 0. i = 1..... n — 2 

£z+2 £z+l £z+l Zi 


( 6 ) 


In the following, we give alternative formulations of the cone regression problem that rest on opti¬ 
mization theory. 


2.1 Convex quadratic programming (CQP) formulation 

2.1.1 Primal formulation 

The problem is to find the point x in the cone that is closest to y. The solution is found at the 
orthogonal projection of y onto Jf, written as Yl{y\Jff) using the metric || • || 2 jW , represented by the 
symmetric positive definite matrix W. 

x = TL(y\J(f) = argmin(y — x) T W(y — x) (7) 

{A*<0} 

For the problem Q. the matrix W is diagonal with element Wi on the diagonal. In practice, if y/ is the 
mean value measured at zu than w* corresponds to the size of the sample at z\. Since K is a closed, convex 
and non empy set on the Hilbert space & n , it is a set of Chebyshev, that is the projection exists and it is 
unique. 


2.1.2 Dual formulation 

The dual formulation of problem 0 rests on the Moreau decomposition theorem Moreau|( [l962a| ), which 
is a generalization in convex analysis of the orthogonal projection theorem for vectorial sub-spaces. 
Central to the Moreau decomposition theorem is the definition of polar cone of a given convex cone Jff, 
which is given below. 


Definition 1 . The polar cone to any convex cone Jtf is given by 

= {xG & n : V& E {k',x) < 0} 


( 8 ) 
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The Moreau decomposition theorem is as follows. 










2.2 Linear complementarity problem (LCP) formulation 


Theorem 1. Let dff C M n be a closed convex cone, /to polar cone and y G a given point. Then 

the following assertions are equivalent: 

(i) x = argmin||x — y|| 2 , = argmin||x — y\\ 2 

(ii) x G G (x,x°) = 0. y = x + x° 

By relying on this theorem we can alternatively solve problem ([7]) by first finding the projection on 
the polar cone x° and then computing the solution to the primal problem as the difference x = y — x° (see 
Fig. [I]). This alternative is attracting since, as it will be clarified below, it implies an analytically simpler 
form for the constraints. 

Before stating an important Lemma about the relationship between the polar cone and the constraint 
matrix A, let us introduce the definition of edges of a polyhedral convex cone. 

Definition 2. Let be a polyhedral convex cone in & n , then the vectors e\ G 8% n \ {0} are the edges or 
generators of Jff if and only if dff — pos({ei}) = {Y,ki e i\k > 0}. 

Intuitively speaking, the edges of a polyhedral convex cone are one-dimensional rays, which always 
passe through a fixed point (the vertex). 

Lemma 1. The rows of the matrix A are the edges of the polar cone, that is Jif° = {x:x = Y!iL\ A.J a i, a i > 

0 }. 

To see that, observe that = {X'=i a.Af.a, > 0} is polar to dff since 

Vv G G (p,x) = ,x) < 0}, which is the definition of polar cone of J(f. 

Conversely, is polar to J(f° since: J(f m = dff. 

By relying on this results Khun-Tucker |Kuhn and Tucker| ( [195 T] ) proved the following theorem: 

Theorem 2. The primal constrained quadratic minimization problem ([7|) is equivalent to the dual problem 


X = argmin(y — A T X) T W(y — A r X) 
A>0 


(9) 


Denoting by X the solution to the dual problem, the solution to the primal problem is x = y — A T X. 

As it can be observed, in the dual formulation each element of the vector A must satisfy a single 
positivity constraint. 

Goldman [Goldman and Ruud (1993) noticed that dual problem can be also view as a minimum 
distance problem in the same parameter space as the primal problem. 


v = argmin \\x\\ 


( 10 ) 


where ^ = {x\x = y — A r A, A > 0} is a rotation of the dual cone with its vertex translated to y. x also 
solves the re-parametrized dual problem. 


2.2 Linear complementarity problem (LCP) formulation 

The CQP 0 can be recasted as a linear complementarity problem (LCP). To see that, let us consider the 
Lagrangian associated to problem 0. 

L(x,X) = \\x~y\\\ w + < A,Ax> ( 11 ) 


where A > 0 is the vector of dual variables associated to each of the convexity constraints. By applying 
the Karush-Kuhn-Tucker (KKT) optimality conditions Kuhn and Tucker ( 1951 ) to ( pT) , that is 


VL(x,A) =0 

(12) 

A >0 

(13) 

X T Ax = 0 

(14) 
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2.3 Proximal formulation 


we obtain the equivalent LCP 

w + MA=4 (15) 

w > 0, A > 0, w r A = 0. (16) 

where w = —A T x, M = —AA T and q = —A T y. Note that by dropping the constant term from the Lagrangian 
and dividing it by 2: L(jc, A) = — y r v +A r Av. Therefore: VL(x, A) = — y r + A r A = 0. By taking 

the transpose and multiplying for —A: — Ax+ (— AA r )A = —A T y. This LCP has a unique complementary 
solution. Denoting by (w, A) its solution, A is the optimal solution of the dual problem 

The condition w T A = 0 is called complementarity condition and the way in which it is dealt with 
determines if the optimization algorithm belongs to the class of interior point methods that will be 
introduced in section TT. 1.41 or to the class of active set methods that will be detailed in section [T2l 


2.3 Proximal formulation 

The CQP 0 can be solved by using a proximity operator Moreau (1962b 1963| . Proximity operators are 
used to solve problems of the form 


argmin f x (x) + h 0) • • • + f m (x) (17) 

where /i,/ 2 , ...,/ m are convex functions from & n to ] — «>, +«>], that are not necessarily differentiable. 
Each fi is treated through its proximity operator which is defined as follows. 

Definition 3. Let To(& n ) be the class of lower semicontinuous convex functions from & n to ] — °o ? +oo] 
such that their domain, denoted by dom(f) is not the empty set. Let / be a function / G rb^"), then the 
proximity operator of / is proxy (x) : & n 8% n such that 

Mx£& n , proxfiy) = argmin/(x) + Lx-y|| 2 (18) 

xeM n z 


The proximity operator is characterized by the property 

V(x,p) e& n x& n , p = prox f (y ) <*=> y~p€df(p), (19) 

where df : fM n —> 2*^" is the subdifferential of /. 

df= |«€f : Vy e& n : (y- p) T u + f(p) </(y))j (20) 

The proximity operator of a convex function is a generalization of the projection operator onto a closed 
convex set . To see that let us consider the indicator function of ^ 




0 if x e 
00 if x ^ ^ 


By using the fact that minimizing J{x) over ^ is equivalent to minimizing J(x) + i<g(x) over 

argmin J(x) = argmin{/(x) + (x)} 

xix^ n 

it results that prox^ = n(-|^). 

The solution to problem ([7} can be therefore understood as the proximity operator over dXA 

X = argmin||x-y|| 2 = argmin{||x-;y|| 2 + jjf (*)} (21) 

= p wx i^y ( 22 ) 


5 ) 
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Alternatively, using the fact that (J^i D... D +... + the dual problem |7]) can be seen 

as the proximity operator over the sum of m indicator functions of the convex sets 


x° = argmin \\x-y\\ z =aigmin{\\x-y\\ 2 +Yi je .o(x)} 
* 6 IE=i je i° xe ^ n 


i= 1 


p rox E"i 1 i x py 


(23) 

(24) 


Intuitively, the base operation of a proximal algorithm is evaluating the proximal operator of a function, 
which itself involves solving a small convex optimization problem. These subproblems often admit closed 
form solutions or can be solved very quickly with standard or simple specialized methods. 


3 STATE OF THE ART 

In this section, we review existing algorithms for solving the cone regression problem. The algorithmic 
approaches, and in turn their numerical performances, strongly depend on the choice of the problem 
formulation. All existing methods are iterative and they can attain the optimal solution since is a 
Chebyshev set and therefore the optimal solution must exist in this closed set. However, in terms of their 
numerical perfomances they can be classified into two broad classes: the class of methods never or in 
very simple cases attain the optimal solution [Hildreth) ( |1954| ); |Dykstra| ( | 19831 ) and those of methods that 
converge to the optimal solution in a finite number of ste ps |W ilhelmsen[( 1976); Pshenichny and Danilin 


(p978|);jWu 


Fathi 


( | 19 82| ); [Fraser and Massam| ( [T98 9); Meyer (1999 pear); Murty and Fathi (1982); Liu and 


(2011). As it will be clarified in the following, methods with asymptotic convergence rest on the 


properties of the sub-gradient or more in general of proximity operators and act by finding the solution 
as the limit of a sequence of successive approximations. They are typically derived from the primal, 
the dual or from the proximal formulation. Methods with finite-time convergence exploit the geometric 
properties of polyhedral convex cones and find the exact solution as non-negative linear combination of 
functions, forming a basis in a specified finite dimensional space. They are typically derived from the 
linear complementarity problem formulation. 


3.1 Algorithms with asymptotic convergence 

This section includes algorithms based on the primal formulation such as Least Squares in a Product 
Space (section [3. 1.1 [ algorithms based on the dual formulation such as the Uzawa’s method (section [3. 1.2| ) 
and Hildret’s method (section [3.1. 3 | ), algorithms that solve the dual problem simultaneously with the 
primal problem such as the Dykstra’s alternating projection method (section [3X5]) and algorithms based 
on the proximal formulation such as Alternating Direction Method of Multipliers (section 3.1.6| ). 


3.1.1 Least squares in a product space (LSPS) 

Since the Euclidean space & n equipped with the dot product is an Hilbert space the problem 0 can 
be recasted in the m —fold Cartesian product of say Pierra| ( [T984 ). Let be the Cartesian 
product of the sets (^); G /, i.e., the closed convex set = {x G : Vi G / : x t G and let 

D be the diagonal vector subspace, i.e. D = {(x,...,%) G : x G Then, the CQP ([7]) is equivalent 
to 


argmin \\x-y\\\ w (25) 

{xeJtr m nD} 

where y = (y, ...,y). Using this strategy, the problem of projecting onto the intersection of m convex sets 
is reduced to the problem of projecting, in an higher dimensional space, onto only two convex sets, one 
of which is a simple vector subspace. Geometrically, this is equivalent to find a point in D which is at 
minimum distance from This point can be obtained iteratively by 


Xk +1 =Xk + h(PD°P%r( x k)-Xk) 


(26) 


The advantage of this strategy is in that it allows to speed up the convergence since a bigger (than 2, which 
is the upper bound for Fejer sequences Eremin ( 1969| )) relaxation interval can be allowed. 






























































3.1 Algorithms with asymptotic convergence 


3.1.2 Uzawa method 

A classical method to solve a convex minimization problem subject to inequality constraints is the Uzawa 
method |Arrow et al.] ( ]1958| ), which search directly for the saddle point of the Lagrangian (jTTJ). In fact, if the 
Lagrangian L(x,X) admits a saddle point, say (v, A), then the duality gap 8 = min xe j^ max% e &+ L(x, A) — 
m ax^ e ^>+min xe ^L(v, A) is null and x is a critical point of the Lagrangian. Since the dual function 
H( A) = argmin xeJ ^L(x, A) is differentiable, it can be minimized explicitly by using the gradient descent 
method. Therefore the Uzawa method alternates a minimization step over f$ n with respect to v with A 
fixed and a maximization step with respect to A onto ^ + , with v fixed. The algorithmic parameter p > 0 
can be fixed to optimize convergence by relying on theoretical considerations. Therefore the CQP ([9} is 
equivalent to find 

x = argminargmaxL(v,p) = \ \x — y|| 2 + < JU,Ax> (27) 

x /1>0 


3.1.3 Hildreth’s algorithm 

Hildreth|Hildreth|11954J proposed to apply the Gauss Seidel algorithm[Kahan||T958| to the dual problem 
A single cycle of the Hildreth’s algorithm consists in updating each element of A sequentially in an 
arbitrary fixed order. Therefore each cycle consists of m steps, each of which corresponds to a projection 
onto the cone J%,i = 1, ...,m. The algorithm gives rise to a sequence of points, each of one differs from 
the preceding in exactly one coordinate. At the cycle k+ 1, the A^ +1 is used in the estimation of the 
point A^ 1 so that the best available estimations are used for computing each variable. The convergence 
of the Gauss Seidel algorithm is guaranteed only if the matrix A is full row rank, so that there are not 
redundancies among the inequality restrictions, and it is guaranteed independently of the initial point A 0 
only if A is positive definite and symmetric. The algorithm is sensitive to the normalization as well as to 
the order of the projections. 


3.1.4 Primal-dual interior point methods 

First introduced by Karmakar in 1984 Karmarkar ( 1984| , primal-dual interior point methods act by 
perturbing the complementarity condition w T A =0 in the LCP formulation \15j and replacing with 
w r A = /i. The partition of vectors w and A into zero and nonzero elements is gradually revealed as the 
algorithm progresses by forcing a reduction of p. All iterates satisfy the inequality constraints strictly. 
The solution is approached from either the interior or exterior of the feasible region but never lie on the 
boundary of this region. Let the function F(x,X,w) be such that the roots of this function are solutions to 
the first and the last optimality conditions in|T~5| 


Fn(x, X,w) = 


w — A t x 
x T -y T +X'A 
^ w T A — /xe 


The perturbed complementarity condition introduces a nonlinearity, therefore for each fixed p > 0 
a system of nonlinear equations should be solved. The nonlinear system is typically solved by using a 
Newton-like algorithm |Ben-IsraeT| ( [ 1966| . Each iteration of the Newton’s method finds a search direction 
from the current iterate (x^, A k,Sk) and it is computationally expensive but can make significant progress 
towards the solution. For instance in barrier methods, which are the most efficent of the family, this is 
achieved by using a penalizing term, called barrier function, for violations of constraints whose value on 
a point increases to infinity as the point approaches the boundary of the feasible region. Interior point 
methods must be initialized at an interior point, or else the barrier function is undefined. The interested 
reader is referred to Singh and Singh| ( |2002 ) for further information about interior point methods. 


3.1.5 Dykstra’s algorithm 

In 1983, Dykstra |Dykstra| ( | 1983 [ ) proposed a generalization of the Hildreth’s procedure applicable to 
the case of constraints corresponding to more general convex cones than polyhedral convex ones. The 
Dykstra’s algorithm is based on the idea, before suggested by Von Neumann von Neumann] ( |195Q| to 
the case of subspaces, of computing the projection onto the intersection of convex sets by relying on the 
solution of the simpler problem of projecting onto the individual sets. In the case of concave regression 
the projection onto a single convex set involves only three points and, if the constraint is not satisfied, 
it corresponds to the straight line fitting the points y*,y/+i,;y/+2- 


7 , 
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3.1 Algorithms with asymptotic convergence 


Dykstra’s algorithm iterates by passing sequentially over the individual sets and projects onto each 
one a deflected version of the previous iterate. More precisely, before projecting onto the cone 
during the (k+ 1)—th cycle, the residuum obtained when projecting onto Jtj at the previous k —th cycle, 
say R k is removed and a new residuum associated to the cone say R k+l is computed after the 

-... + R k m onto where 


-...+Rf_i+^ 


/+1 


projection. In practice, each x k is the projection of y + R k - 

R\ = ^i-{y+R\+RU+ R \+l + ---+ R m 1 )- 

If each is a subspace, then at each new cycle k + 1 the residuum —R k of the projection onto each 
convex cone is of course the projection of x k onto the cone JF'F. Therefore, the Dykstra’s procedure 
for subspaces reduces to exactly the cyclic, iterated projections of von Neumann. In this case, for k -A °o , 
the sum of the residua over the cones JF i k approximates the projection of y onto the polar cone and 
therefore, for the Moreau decomposition theorem, x k approximates the projection onto Jfr. 

However, if the are not subspaces, n(-| is not a linear operator and then the von Neumann 
algorithm does not necessarily converge. 

The Dykstra’s algorithm can also be interpreted as a variation of the Douglas-Rachford splitting 
method applied to the dual proximal formulation ( [23] ). 

The seminal works of Hildreth and Dykstra have inspired many studies mostly devoted to theoretical 


investigations about their behavior in a Hilbert space Boyle and Dykstra (1986); Varian (1984), about 

its convergence Iusem and Pierro 

1991'); Crombez (1995), about its relation to other methods Gaffke 

and Mathar (1989); Bauschke et al. 

(1994 ) and about its interpretation in more general frameworks, such 

as the proximal splitting methods 

Bauschke et al. (2011). Han Han (1988), as well as Iusem and De 


the Hildreth’s algorithm and therefore it has the same geometric interpretation of Gauss Seidel to the 
dual problem. Gaffke and Mathar (1989) [Gaffke and Mathar|fl989| ) showed the relation of the Dysktra 


algorithm to the method of component-wise cyclic minimization over a product space, also proposing a 
fully simultaneous Dykstra algorithm. The only works devoted to give some insight about a more efficient 
implementation are the ones of Ruud. Goldman and Ruud |Goldman and Ruud| ( [ 1993| )( 1993) generalized 


the method of Hildreth showing that there is not need to restrict the iterations to one element of A at a time: 
one can optimize over subsets or/and change the order in which the elements are taken. This observation 
is important for the speed of convergence since the slow speed can be understood as a symptom of near 
multicollinearity among restrictions. Because the intermediate projections are so close to one another, 
the algorithm makes small incremental steps towards the solution. They also remarked that Dykstra uses 
a parametrization in the primal parameter space, that causes numerical round off errors in the variable 
residuum. These round off errors cumulate so that the fitted value does not satisfy the constraints of the 
dual problem. It would be better to use a parametrization on the dual so that the contraints in the dual 
would be satisfied at each iteration. Later, RuudjRuud ( 1991) proved that the contraction property of the 


proposed generalizations rests solely on the requirement that every constraints appears in at least one 
subproblem of an iteration. As one approach the solution, constraints that are satisfied at the solution are 
eliminated. To remove satisfied constraints would accelerate the Hildreth procedure. The authors propose 
to reduce the set of active constraints, that is the constraints satisfied as an equation at the corresponding 
points, by removing as many constraints as possible through periodic optimization over all positive 
elements of A. 

3.1.6 Alternating Direction Method of Multipliers (ADMM) 

ADMM is an augmented Lagrangian technique Hestenes ( 1969| ; Powell ( 1969) which can be applied to 
problems of the form 


Find argmin | \y - 

ze^ m Ax=z,z< 0 


■x\\ 2 +g(Ax) 


(28) 


where the matrix A is assumed to be irreducible (AA T = v/, v > 0) and the intersection of the relative 
interiors of the domains of the two functions is assumed to be not empty (n dom(g ) D ri dom(f) ^ 0). 
ADMM minimizes the augmented Lagrangian j£f over the two variables of the problems, say x and z, first 
v with z fixed, then over z with v fixed, and then applying a proximal maximization step with respect to 
the Lagrange multiplier A. The augmented Lagrangian of index ye [0,°o] is 

z, y) = f (x) + g(z) + 1A ' T (Ax - z) + LI \Ax - z\ | 2 
7 z 7 


(29) 

























































3.2 Algorithms with time-finite convergence 


where f{x) = | \y — x\ | 2 . Denoting by proxj the proximal operator which maps a point z E 3% n to the 
unique minimizer of f(x) + 11 Ax — z\ | 2 and denoting by prox g = proxfoA the implementation detailed in 
the Appendix is obtained. 

The ADMM method rests on the proximal formulation |2l) Indeed, it can be viewed as an application 
of the Douglas-Rachford splitting algorithm Eckstein and Bertsekas ( 1992| . 


3.2 Algorithms with time-finite convergence 

All algorithms that will be reviewed in this section are active set methods resting on the LCP formulation 


15 Active set methods work by choosing a subset of indices j EJ CJ = {l,...,n} such that wj is allowed 
to be non-zero and forcing the corresponding Xj to be zero, while the remaining indices j £J\J force wj 
to be zero and allow Xj to take nonzero values. In this section we will review active set algorithm suach 
as the mixed primal-dual basis algorithm (section [3.2. 3 1 ), the critical index algorithm (section [3. 2. 4| ) and 
the Meyer’s algorithm (section 3.2.5| ). 

Before detailing the algorithms, we introduce some definitions and basic results about the geometry of 
polyhedral convex cones, on which are based the algorithms presented in this section. For further details 
the reader is referred to |Silvapulle and Sen| ( 2011[ ). 


3.2.1 Properties of polyhedral convex cones with m<n 

Lemma [T] establishes the relationship between the constraint matrix A and the edges of the polar cone 
{Yd — 1, namely A T = [y 1 ,..., y" 2 ]. We would like now to determine the edges of the constraint 

cone Jff. 

Let the vectors {y m+1 ,.., y 1 } vectors orthogonal to {/, i = 1,.., m} and orthonormal to each other so 
that the set {/, i = 1, forms a basis for & n . By defining the dual basis of the basis {/, i = 1, ..,n} as 
the set of vectors {f5\i= 1,...,^} that verify the relationship 


mw ={- 0 1 ;;j 

the constraint cone J(f = {x : Ax < 0} can be equivalently written as 

r m n 

= jx:x= + £ bifi l ,bi>0,i=l,...,mj 


(30) 


(31) 


;=i 


i=m +1 


To see that let B = [f3 1 ,..., [3 n ] and C = [y 1 ,..., Y 1 ) • Then Ax are the first m coordinates of Cx. Since 
B t C = —I n by construction, by multiplying both members at left for B~ l and at right for x, we obtain: 
Cx = —B~ l x. Therefore Cx gives the negative coordinates of v in the basis {/T, i = 1, Furthermore, 
points in JC have their first m coordinates non-negative and can be written as v = Y!l=\ biP\ where bi > 0 
for i = 1, ..,m. 

Taking into account Def. |Eq. 

JT. 
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established that the vectors j 3 l are the edges of the constrain cone 


Definition 4. Let JC be a polyhedral convex cone in C$ n , then F C Jtf is a face of JC if and only if F is 
the intersection of JF with a supporting hyperplane. 

A polyhedral convex cone arises as the intersection of a finite number of half-spaces whose defining 
hyperplanes pass through the origin. The /—th row of A is normal to the hyperplane generating the i —th 
closed half-space. 


The following Lemma, proved by Rockafellar in 1970 Rockafellar ( [1970 ), establishes a relationship 
between the first m vectors of the dual basis {/T,z = 1, and the faces of the cone JC' = JC fl 

span(JC), where span(JC) denotes the subspace spanned by the m edges of JC. 

Lemma 2. Let JC = {x : Ax < 0}, where A T = [yi,... , y m ], be the constraint cone and let {/T, i = 1,..., m} 
be the dual basis of {/', i = 1, Denoting by span(JC) the subspace spanned by the m edges of JC , 

let JC 1 — JC Dspan(JC) = {xG : v = Y!jLi bjPfbj > 0}. Then, for J C {1, the faces of Jff' 

are the sets: 


lx G :x=^ bjPfbj > o| 

j^J 
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Figure 2. The point x\ G JF belongs to the open face Fj = {x G :x = bf l ,b > 0}, with J = {1}. 
The support cone of JF at x\ is = {x : y lT x < 0} and its dual is = {x:x = cy 2 ,c > 0} 

The set of points that project onto x\ is given by the set TVf^(x\) = {x\ = {x\ +cy 2 ,c > 0}. 

The point 13 GX belongs to the open face Fj = {0}, with J = 0. The support cone of at X 3 is 
and its dual is J(f °, so that the set of points that project onto V 3 is {The point 14 EX belongs to 
the open face Fj = {x G dff \x = Y*i=\gbifi l ,bi > 0}, with J = {1,2}. The support cone of J(f at X 4 is 
the origin and its dual is the origin, so that the set of points that project onto X 4 is {^ 4 }. 


The set of all relatively open faces 

jZy = jv G £% n :x=^ bjPfbj > 0 j 
j^J 


forms a partition of Jif '. 

Denoting by u and v the projections of y onto span(JF) and span({y m+1 ,..., f 1 }) respectively, being 
the two subspaces orthogonal, y can be written as y = u + v. Since v can be easily computed as v = 
X{X T X)~ l X T y with X = [y m+1 ,..., y"], the problem ([ 7 ]) reduces to find the projection of u onto J(f'. 

The next Lemma, proved by Zarantonello Zarantonello (197]]) focuses on the set of points in span(JF) 
projecting on a given point v G say n^J, (v). Before stating it we need to define the support cone of a 
closed convex set at a given point. 


Definition 5. The support cone of a closed convex set at v denoted by JZV (x) is the smallest convex 
cone with vertex at the origin containing — v. 

Lemma 3. Let JF', ffij, {yd = 1, ...,n} and {fid = 1, ...,n} defined as in Lemma [ 2 ] If x is a point of 
W belonging to the open face Fj, then: 

• n^,(x) = x + Jf^,(x) = {x + Zi<tjCiY,Ci >0} = {'Ljejbjpj +'E i pc i Y,b i >0,c; > o), 

where x = Y,jeJ b jP 2 - 

• The sets (x) are disjoint closed convex cones. 

• =span(Jf) 

where (x) denotes the dual of the support cone of at x. 

Then any point in span(JF) projects onto an unique point in J(T' and belong to a unique non-negative 
orthant, or sector Sj 


yj= ixe& n :x = Y, h .iP J + H c jY J ' b j >0 ’ c f - 0 } 
FJ MJ 


(32) 
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Fig. [^illustrates this result for span(Jf) = & 2 . Points in Sj project onto the subspace spanned by the 
vectors {j3/y G /}, that is on the face Fj = Y^jejbjl 3/ Vectors belonging to J(T° project onto the origin, 
vectors belonging to Jfr project on themself, while each other vector of M n project onto an unique face of 

JT. 

Therefore, if the sector Sj containing the vector u is known, then the projection of u onto can be 
easily computed as projection of u onto the subspace spanned by the {/3 7 , j G /}. This reduces the problem 
of projecting y onto to the problem of finding the set of indices / such that the sector Sj contains 
u. The set complementary of /with respect to {1, corresponds to the indices of the constraints 
satisfied at equality in the optimal solution. 

The algorithms described in this section propose different strategies to find the optimal set /. 


3.2.2 Early algorithms based on the properties of polyhedral convex cones 

The first algorithm addressing the problem of projecting a point y G f$ n onto a polyhedral convex cone 
dff C 8% n by a non-asymptotic procedure dates back to work of Wilhelmsen |Wilhelmsen| ( j 1 976 | in 1976 . 

Wilhelmsen assume that the m generators of the cone drff — jv G & n : v = Y!JL\ bifi\bi > 0 j are 

known and propose an algorithm which compute a sequence of nearest points x k to y in subcones Jff k of 
dff. Each subcone Jff k is chosen so that x k G int(K k ) and is closer to y than is x k ~ l . This means that x k 
is in the near side of the supporting hyperplane of Jtf k ~ l with respect to y. The key step is to find x k+l 
given x k and the proposed procedure to do that is laborious. 

Pshenichny and Danilin ( 1978 ) [Pshenichny and Danilin ( 1978 | ) proposed an algorithm similar to the 
one of Wilhelmsen which also converges in a finite number of steps. In both algorithm m can be any 
integer even larger than n. A more efficient procedure, but with the more restrictive assumption that m<n 
has been proposed by Fraser and Massam in 1989 . 


3.2.3 Mixed primal-dual basis algorithm 

Fraser and Massam Fraser and Massamj fl989| proposed an iterative algorithm to solve the general 
problem of projecting a data point y G 8 % n onto a polyhedral convex cone dff C 8 % n generated by m < n 
linear inequality constraints. 

Polyhedral convex cones generated by a number of inequalities at least equal to the dimension of the 
space they belong to have been the subject of section |T 2. 1| As seen there, the problem of projecting 
a data point onto this class of cones can be reduced to find the set of edges , or generators of the cone, 
indexed by / C 1,..., n such that the sector Sj contains the data point. 

To this goal, the set of edges of the polar cone {f,i = 1 is completed by n — m vectors 

orthogonal to {/, i = 1,..., m} and orthonormal to each other. In the case of concave regression m = n — 2, 
the set is completed by a constant function f n+x = 1/| |1| |, where 1 is the ra-dimensional unitary vector 
and by a linear function y m + 2 = (x — xl)/11 (x — xl) \ \, where x = (x\ ,..., x m )' and x = Y!iL\ x i / m • The set 
of vectors {/, i = 1, ...,^} form a basis for 8 % n . 

Let the vectors { Pi , i = 1,..., n} be the dual basis of the basis {/, i = 1,..., n} as defined in (30). Fraser 
and Massam called the vectors (3 l and f primal and dual vectors respectively. A primal-dual basis for 
associated to the set of indices / C{l,...,^}=Lisa basis £%j — [oq,..., a n ] made up of a subset of the 
primal basis vectors and a complementary subset of the dual basis vector {Yi}ieL\j- F° r n = m 

the primal basis vectors, corresponding to the edges of J(f, are simply the columns of — Using 

the above definitions, the problem of projecting a point y G & n onto the cone can be formulated as 
follows. 


Theorem 3. The primal constrained quadratic minimization problem 0 is equivalent to the problem of 
finding 

argmin||w — v|| 2 (33) 


where u = y — v, with v = n(y| 1 s , /7a/t(y m+1 ,..., Y 1 )). Denoting by x u the solution to this problem, the 
solution to the primal problem ^ is x = x u + v. 

Finding the sector containing u is achieved moving along a fixed line joining an arbitrary chosen 
initial point x° inside the cone or on its boundary to the data point u. By moving along a fixed line, many 
sectors are crossed: each time a sector is crossed the successive approximation x k is obtained by projecting 
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3.2 Algorithms with time-finite convergence 


the point on the line passing through u and v° on the face of JC (see Lemma [5]) so that the distance 
11 u — x k 11 is decreasing in k. 

At each iteration each basis differs from another by one vector only and therefore the coordinates of 
x k onto the new basis are easy to calculate: what changes is that one coordinate becomes equal to zero and 
therefore there is not need of estimating the inverse of a matrix at each step. For this reason this algorithm 
is faster that the one of Wilhelmsen. 

Points belonging to the sector Sjk have non-negative coordinates in the mixed primal-dual basis 3$j k 
relative to the cone JC. Therefore the procedure terminates when the coordinates of the data point in the 
primal dual basis 38jk are all nonnegative, meaning that the point x k is on the face of the sector containing 
the data point u. 

The number of iterations needed is equal to the number of different sectors that the line joining the 
initial point to the data point has to cross. This number is bounded above by 2 n . It is worth to remark that 
crossing a sector corresponds to a pivot step in the equivalent LCP. In fact, the net effect of a pivot step is 
of moving from the point x k contained in the face of J(T to the point x^+i contained in the face Fjk+i 
of Jf. 

Ten years later, Meyer |Meyer| ( ] 1999] ) generalized the algorithm of Fraser and Massam to the case of 
more constraints than dimensions, that is when m> n. 


3.2.4 Critical index algorithm: Nearest point problem in simplicial cones 

Murty and Fathy (1982) [Murty and Fathi| ( |1982| ) considered the general problem of projecting a given 
vector y G 38 n onto a simplicial cone JC C 38 n . The definition of simplicial cone and pos cone , on which 
the former definition is based are as follows. 


Definition 6. The pos cone generated by the vectors in A = {5*, i = 1, denoted by pos{ A), is the 

set {x G & n \x = Y!iL\ diS\di > 0}. 

Definition 7. A cone J(f C M n is said simplicial if it can be expressed as a positive linear span of n 
linearly independent vectors A = {S\i= 1, ...,n} in 3& n (i.e., a basis for 38 n )\ J(f = pos( A). 

For any point v G pos( A) the vector d = D~ l x, where D — [5 1 ,..., 8 n ] is called combination vector 
corresponding to v. Therefore the projection x of y onto pos (A) can be expressed as a nonnegative linear 
combination of the edges of the cone: x = Y!i=i dfi 1 , where the optimal combination vector corresponding 
to v is d = D~ l x. 

Murty and Fathy named the set of indices / C {1, such that d ie j > 0 set of critical indices. 
Using the definitions of simplicial cone and combination vector, the original problem of projecting the 
point y G 3% n onto the cone dff can be formulated as follows. 

Theorem 4. The primal constrained quadratic minimization problem Q is equivalent to the problem 
d = argmin(w — Dd) T W(u — Dd) (34) 

d> 0 


where u = H{y\span(JiT)) and D = [y 1 , with {/ = Af ,i = 1, and {Y,i = m + 1, 

defined as in section \3.2.3\ 

Denoting by d the solution to this problem , the solution to the primal problem Q is x = y — Del + v. 


This formulation has the same structure of the dual formulation ([9]), where the combination vector 
d in (34) plays the same role as the dual variable A in 0. The only difference is that in ([ 9 ]) the matrix 
A G is not squared as the matrix D G & nxn . 

As shown in section |2^2| for the dual quadratic formulation ([9]), the formulation ( [34] ) can be recasted 
as a LCP. This equivalency is important since the following theorem, proved in |Murty and Fathi (1982) 
applies to the LCP formulation of ([34]). 


Theorem 5. If a single critical index for the LCP problem of order n is known, the problem can be 
reduced to a LCP of order n— 1. 


The fact that finding a critical index reduces the dimension of the problem can be argued geo¬ 
metrically. Let / be a critical index, then denoting by NPP[F;u] the subproblem (34), where T = 

{Y,i= 1, ... : n} its solution is also the solution to the NPP[T U{—y*};w]. Defining u = u— and 
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T = {y 1 ,...,y / where f = f — ^ than /,/ G {l,...,n}\Z is orthogonal to / and 

the cone /?as(ru {—/}) is the direct sum of the full line generated by y 1 and the simplicial cone pos(T). 
Solving NPP[F, u] is an n — 1 dimensional problem. If x * is the solution of the NPP(T, u), then the 

solution x to the NPP[F ; u\ is obtained as x = x* + ^ . 

By relying on Theorem [5j the authors proposed an algorithm consisting of a subroutine to identify a 
critical index for the problem, followed by a subroutine which reduces the size of the problem once a 
critical index is found. Since the solution d is the orthogonal projection onto the linear hull of {/, i G /}, 
if / is known, the solution of the equivalent LCP(q,M ) and correspondingly the solution to NPP[T ; u\ can 
be easily found. 

The routine to identify a critical index operates on the NPP[T ; u\ by exploiting the geometric properties 
of projection faces of a pos cone, whose definition is as follows. 

Definition 8. Let S C T. pos(S) is a face of pos(T) of dimension |S|. pos(S) is said to be a projection 
face of pos(T) if H(u\span(S)) G pos(S). 


In the following we enunciated some theorems on which the critical index algorithm is based. Their 
proofs can be found in |Murty| ( 1988| ) (chapter 7). 


Theorem 6. Let S C T, S Y 0. The optimum solution ofNPP[S; u] is in the relative interior of pos(T)if and 
only if the projection ofu onto the linear span ofS is in the relative interior of pos (S): II(u\span(S)) G 
ri(pos(S)). 

Theorem 7. Let x = Dd be the optimum solution of NPP[T,u\. Let J the set of critical indices and 
S = {yfj G J}. Then pos(S) is a projection face of pos (T). 


Theorem [7] tells that the projection onto the cone belongs to the pos cone generated by the set S 
of vectors corresponding to critical indices and that such pos cone is a projection face. Therefore, is the 
set of critical index is known, for Theorem [6] the solution can be computed as projection onto the linear 
subspace spanned by vectors in S. 

The routine maintains a non empty subset of T called the current set denoted by S , and a point called 
the current point denoted by x. At each stage of the routine x G pos(S). When termination occurs the 
routine either finds the nearest point in pos(T) to u in which case the problem is completely solved or it 
finds a critical index of the problem. In the latter case an LCP of order (n— 1) can be constructed and the 
same routine can be applied to this smaller problem. Hence the unique solution of the original problem 
can be obtained after at most n applications of the routine which finds a critical index. 

A characterization useful to find a critical index or the solution to the problem is provided by the 
following theorem. 


Theorem 8. Let x G pos(T) be such that 0 G T (w,x), where T(u,x) is the tangent hyperplane at x to the 
ball of center u and radius x. If there exists an index j such that (u — x) T f< 0 for all i j and (u — x) T yi 
then j is a critical index of NPP(T, u) 


A characterization of the optimal solution in terms of separating hyperplanes is given by Robertson et 
al. [Robertson et al.[|l988] ). 

Theorem 9. A point x G pos(T) is the nearest point in pos(F)to y if and only if 0 G T(y\x) and (y — 
x) T yj < 0, V j = 1, ...,n, where T(y,x) is the tangent hyperplane to pos(T) in x. 


The routine to find a critical index alternates distance reduction operations with line-search and 
projection steps to find a projection face. In practice, the routine starts by projecting on the closest edge to 
the data point. If the optimality condition is not satisfied, than the procedure iteratively adds vectors to S 
and updates the point x while consistently reduces the distance between u and x. The distance reduction 
operation is carried out efficiently by projecting onto the nonnegative hull of two vectors in & n 9 the 
current point x and a vector f satisfying one of the conditions given by the following theorem. 

Theorem 10. Given IG pos(T), x 0 such thatO G T(y,x), if for some i G {1, ...,h} we have (y—x) T y> 
0 and either: 


• | \x — y\ | < | \U(y\Y) — y\ \ and {v, / } is linearly independent, or 
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• y T Y < o 

then the projection of y onto the linear hull of {x, f} is in the relative interior of pos{x, f} 

Once such updates are not longer possibles, it employs a sequence of line-search steps and projections 
in the subspace spanned by the vectors in S to find a projection face of the corresponding pos cone. This 
line-search is in the same spirit than the one proposed by Fraser and Massam since the goal is to reduce 
the distance to the data point while keeping at the interior of a pos cone. In the particular case of concave 
regression, for which m < n, it can be implemented exactly in the same way. 

This algorithm results to be much faster than the MPDB algorithm. The primary source of its 
computational efficiency is in that it relies mostly on distance reduction operations and size reduction 
steps whose computational requirement is relatively small compared to the computational effort required 
to find a projection face through a line-search. 

Recently, Liu and Fathy (2011) [Ou and Fathi| ( [2011 1 ) generalized the work of Murty and Fathy (1982) 
to polyhedral non-simplicial cones, hence allowing the set F to contain more than n vectors. What allows 
the generalization is the equivalence between the structure of the two problems through the concept of 
polar cone. 

The authors also proposed several strategies for efficient implementation mostly based on the mathe¬ 
matical properties of the entities involved. We have incorporated, where possible, these strategies, to all 
algorithm tested for objective evaluation of performances. 


3.2.5 Meyer’s algorithm 

Meyer Meyer (pear]) considered the general problem of projecting a data point y G & n onto a polyhedral 
convex cone C 8% n generated by a finite number m of linear inequalities. The problem is reduced to 
find the set of indices / C {1, = L, where M <m is the number of linearly independent constraints, 

corresponding to not saturated constraints at the solution. Meyer called these indices hinges. 

When m<n , the algorithm can be applied to both the primal and the dual formulation, whereas for 
m > n it is applied to the dual formulation. In the following, since for the problem of concave regression 
m < n we consider how to solve the primal problem 0- 

The search for / is performed iteratively, starting with an initial guess by removing or adding one 
index at time, until the optimal solution is obtained. For each candidate J k two conditions are tested: the 
interior point condition and optimality condition. 

The interior point condition is satisfied when the current iterate belongs to the interior of pos(S ), that 
is when x k G pos(S ), where S C {/T, i G J k } . By using the following theorem 


Theorem 11. Let S C {/3 \i = l...,m}, S 0. The optimum solution of problem (]t|) is in the relative 
interior ofpos(S) if and only ifH(y\span(S )) is in the relative interior ofpos(S). 


x k can be computed as projection of y onto the linear hull spanned by the vectors {j 3\i eJ k }. If the 
feasibility condition is not satisfied, the index j G L\J k corresponding to the most negative coefficient is 
added to J k and the interior point condition is checked again. 

Once the feasibility condition is satisfied, the optimality condition is tested by using the characteriza¬ 
tion given in Theorem]?] If it is not satisfied, the vector (3\i GJ k which most violates the condition is 
removed. The procedure continues until both conditions are satisfied. 

Convergence is guaranteed by the fact that when the algorithm replaces just one edge, the Sum of the 
Squared Errors (SSE) after is less than the SSE before so that the algorithm never produces the same set 
of edges twice, which would result in an infinite loop. 

In practice, each time an hinge is added, the best solution with n+ 1 hinges where the first n hinges 
are already given is obtained. But this is not in general the best fit with n + 1 hinges, so that some hinge 
may need to be changed. Therefore, the optimal solution can be interpreted as the best approximation 
with the biggest possible number of hinges. 


4 ISSUES ABOUT EFFECTIVNESS FOR LARGE-SCALE PROBLEMS 

In this section, we discuss stenghts and limitations of the algorithms detailed above in solving large-scale 
instances of a particular kind of cone regression, the concave regression. In particular, we consider 
computational issues related to numerical stability, computational cost and memory load as well as the 
suitability to take advantage of available good estimates and to be implemented in an online fashion. 
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4.1 Suitability to take advantage of available good estimates 


4.1 Suitability to take advantage of available good estimates 

One general strategy for reducing the computational cost of a large-scale optimization problem is to use 
an initial guess, easier to calculate and close to the optimal solution. 

Within the class of algorithms with asymptotic convergence, splitting-based methods work by activing 
each of the convex constraints repetitively and by combining them to obtain a sequence converging to a 
feasible point. Since the projection point is characterized by the variational inequality 

x = Tl(y\J^) e Vx G : (y — x,x — x) < 0 (35) 


the projection operator is a closed contraction. Therefore the set of fixed points of is 

exactly . This prevents the use of an initialization point belonging to the feasible set as well as the use 
of multiscale strategies since there is not guarantee that the solution from a previous level does not belong 
to the feasible set. 

The same difficult arises when considering interior point algorithms, since them need to be initialized 
to an interior point. In Goldman and Ruud| ( |1993 ), Goldman proved that the Dykstra’s algorithm can 
potentially starts from better starting values than the given data point y. The author established the 
convergence to the nearest point to the primal cone from an arbitrary point in the intersection of ^ and 
the ball of radius ||y||, where ^ = {x\x = y — A r A, A > 0}, is a rotation of 7T radiants of the dual cone 
with its vertex translated at y. A point satisfying these conditions can be obtained efficiently by using 
distance reduction operations based on Theorem [T0| It is worth to remark that this result can be easily 
interpreted in the active set framework. In fact, the Dykstra’s algorithm can be undestood as a primal 
active set method and its solution is a primal dual feasible point. Therefore, any dual feasible point, that is 
every point belonging to the set can be used as initialization. 

All algorithm with time finite convergence detailed in the previous section are primal-dual and they 
reduce the problem of projecting a given data point onto a convex set to the problem of finding the set of 
indices corresponding to not saturated constraints at the solution. In general, they involve the selection of 
a subset from a collection of items, say /C {1, With this formulation, they potentially allow to 

take advantage of a good estimate of the optimal active set. However, the adaptation is not rapid since the 
active set estimate is updated of one-by-one changes preventing this class of method from being effective 
general-purpose solvers for large-scale problems. 

In the algorithm of Fraser and Massam, the set of successive approximations are obtained by moving 
along a fixed line connecting the initial guess to the data point. The number of iterations needed to reduce 
the data point reduces the number of sectors to be crossed to join the sector containing the data point. 

By contrast, in the algorithm of Meyer the proximity of the initial guess to the data point is not a good 
criterion selection for the initial guess. In fact, given an initial guess, the solution is attained by adding 
and/or removing one index at time until optimal solution is found. Taking into account that the optimal 
solution can be interpreted as the best approximation with the biggest possible number of hinges, if the 
optimal solution contains just a few hinges, than using the empty set as an initial guess would result much 
faster than using the full set of possible hinges. On the contrary, if just a few constraints are satisfied at 
equality in the optimal solution, than the full set of indices will be a much better initial guess than the 
empty set. Therefore even if the choice of the initial guess may highly influence the performances, its 
choice depends on the data and there is not a well established criterion to fix it. 

The Murty and Fathy’s algorithm reduces the size of the problem each time a critical index is found. 
Therefore it is not compatible with strategies that take advantage of a good initial estimate since a good 
estimate does not lead to find a critical index faster. 

To overcome the limitation of active set methods in taking advantage of a good estimate, Curtis et al. 
|Curtis et ak| ( |2012| ) have recently proposed an euristic framework that allows for multiple simultaneous 
changes in the active-set estimate, which often leads to a rapid identification of the optimal set. However, 
there is not guarantee of the computational advantages for general problems and, furthermore, the authors 
recommend their approach when solving generic quadratic programming problems with many degrees of 
freedom, that is not the case of general concave regression problems. 


4.1.1 PAV’s inspired approximate solution 

To evaluate through numerical simulations the suitability to take advantage from good initial estimates, 
we propose an algorithm inspired to Pool Adjacent Violators (PAV), whose computational complexity 
is <^(n). Starting from the original signal, violated constraints are removed one by one by projecting 
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4.2 Suitability to be implemented in an online fashion 


the current iterate onto the convex cone corresponding to the violated constraint until a primal feasible 
solution is obtained. Since the dual feasibility of each iterate is not guaranteed, the founded solution is not 
optimal. However, in our experience the solution is a very good approximation of the optimal solution. 

4.2 Suitability to be implemented in an online fashion 

Another strategy to deal with large projection problems would be to build and evaluate the solution 
incrementally according to the order of its input, as done in online methodologies developed for dynamic 
optimization |Bhatia and Biegler] \\996) over a stream of input. Even if the input is given in advance, 
inputs are processed sequentially and the algorithm must respond in real-time with no knowledge of future 
input. Each new input may cause to rising or falling constraints and the final solution is a sequence of 
feasible solutions, one for each time step, such that later solutions build on earlier solutions incrementally. 

Of course this strategy requires that the algorithm respond in real-time for each new input, that would 
not possible when dealing with large matrix inverse computations. Let x G f$ n be the projection of y G 
onto and let x G be the projection of y G & n+l onto J(f n+l . When a new element is added to 

the data point, a new constraint is added too, so that the constraint cone has a new edge. If this constraint 
corresponds to a critical index, that is to a constraint satisfied at equality in the optimal solution, then 
the projection face will be the same so that no further computing will be needed. On the contrary, if the 
new constraint does not correspond to a critical index, the projection face will change, including the edge 
corresponding to the new constraint and removing and adding some others. Therefore, the major difficulty 
faced by online strategy is the same faced in exploiting good estimates. 


4.3 Computational issues 

As highlithed in the previous section, despite the different strategies implemented by algorithms with 
time-finite convergence, generally an index at time is iteratively added or removed from the current set 
of indices J k until both the feasibility condition and the optimality condition are satisfied. Checking the 
optimality condition involves computing the inverse of a matrix that differs slightly from the matrix of the 
previous iteration. What ’’slightly” exactly means depends on the specific algorithm and it is detailed in 
the following. 

The algorithm of Fraser and Massam involves the computation of a n x n fixed size matrix inverse 
at each iteration. The matrix to be inverted differs from the matrix of the previous iteration only for 
one column, being the change of the form A —x (A + mxv), where u is an unitary vector with only one 
nonzero component corresponding to the index of the column to be changed, and v corresponds to the 
difference between the elements of the dual basis vector and the elements of the primal basis vector or 
viceversa. In this case the inverse can be updated efficiently by using the Sherman-Morrison formula: 
(A + uxv)~' = fff , where z = A 1 u,w=(A 1 ) r v, A = v T z- Therefore only two matrix multiplications 
and a scalar product are needed to compute the new inverse at each step. 

The algorithm of Meyer, as well as the one of Liu and Fathi Liu and Fathi (2011) involve the 
computation of an inverse matrix of variable size at each iteration. The matrix to be inverted differs from 
the matrix of the previous iteration only for one column, which has been added or removed. Since the 
matrix to be inverted A(J) G & rxn , with r < m is generally rectangular, the Moore-Penrose generalized 
inverse or pseudoinverse is computed:A(/)^ = A(/) r (A(/)A(/) r ) _1 . 

In Matlab and Scilab, the computation of the pseudoinverse is based on the Singular Value Decom¬ 
position (SVD) and singular values lower than a tolerance are treated as zero. The advantage of this 
approach is that the pseudoinverse can be computed also for a nonsingular matrix. However, the method 
proposed by Liu and Fathi Liu and Fathi ( 2011| ) to improve the computational efficiency of their algorithm 
does not take advantage of SVD approach since it consists in updating the matrix (A(/)A(/) r ) -1 . If the 
matrix A (/) is ill-conditioned, then the inverse cannot be computed with good accuracy and for the matrix 
A(/)A r (/) is even more so because this operation squares the condition number of the matrix A (/). 

A better solution would be to update directly the pseudoinverse. This can be achieved when a 
column is added by using the method proposed in Andelic et al. (2006) Let A T G & nxr , x G 3% n and 


B= (A 7 
follows. 

B 1 = 


x) G< 


pix(r+1) 


The pseudoinverse of B can be computed from the pseudoinverse of A 1 as 


A^ — Atxw^\ 


w’ 


) 


, where w = (I — AA^)x and wA = 


Alternatively, the transformation A(J) T (A(J)A(J) T ) 1 y can be efficiently computed by a QR de- 
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composition approach. Let A 7 = (QnQn) ( }} ) be the QR decomposition of A r , where R\\ is an 


*11 
0 / 

mxm invertible upper triangular matrix. Then: A(/) r (A(/)A(/) r ) _1 = Qn(R 7 1) . The inverse of an 

upper triangular matrix can be efficiently implemented by a left matrix division or by more sophisticated 
methods as the one proposed in |Mahfoudhi| ( [2012| . 

Courrieu Courrieu|( |2QQ5| proposed a method based on the full rank Cholesky decomposition which 
has a computation time substantially shorter of the method based on SVD decomposition. The two main 
operations on which his method is based are the full rank Cholesky factorization of A r A and the inversion 
of L t L, where L is a full rank matrix . On a serial processor these computations are of complexity order 
n 3 ) and <^(m 3 ) respectively. However, in a parallel architecture, with as many processor as necessary, 
the time complexity for Cholesky factorization of A 7 A could reduce to while the time complexity 
for the inversion of the symmetric positive definite matrix L r L could reduce to @(log(r)). However, the 
computational advantage of this method can be appreciated only when r«n, since the inverse of a r x r 
matrix has to be computed, which is not in general the case, specially for concave regression problems. 

The method proposed by Zhu and Li Zhu and Li ( 2007| ) for recursive constrained least squares 
problems, found that the exact solution of Linear Equality-constrained Least Squares can be obtained 
by the same recursion as for the unconstrained problem, provided that the Rescricted Least Squares 
procedure is appropriately initialized. However, even this approach offer significant advantages in term of 
computational cost only when the number of constraints m is small, which is not the case for large-scale 
concave regression problems. 


5 IMPROVING THE ACTIVE SET FRAMEWORK FOR CONCAVE REGRES- 
SION PROBLEMS 

An active set algorithm for solving the concave regression problem generates a sequence of quasi 
stationary points. A primal (dual) feasible active set algorithm generates a sequence of primal (dual) 
feasible quasi stationary points with decreasing objective function values and terminates when the dual 
(primal) feasibility is satisfied. An active set J induces a unique partition of the indices {1, into 
blocks. A block B of such partition is a set of consecutive integers, say, {p,p + 1, such that the 

index i of the constraint Xi+i —Xi < jc/+2 — x i+\ is in J for each i such that p < i < q — 2. Conversly any 

such partition of indices determines a unique active set. Denoting by A* the multiplier associated with the 

ith constraint, the Karush-Kuhn-Tucker can be written as: 

Xi+\-Xi<Xi+2-Xi+\ i=l,...,n-2 

yi-xi = Ai 
yi —xi -2Ai + A2 
y 3 -*3 = X\ - 2 A 2 + A 3 

< yn— 1 X n —1 = Arc—4 2X n —3 H - Aft _2 

y yi — 1 %n —1 = Aft—3 2Aft_ 2 

yn x n = Aft—2 

A, > 0 i = 1,..., n — 2 

Aj(2x,-+i —Xi—Xi+ 2 ) — 0 / = 1,..., n — 2 

It is easy to show that: Y!i=\ Kyi ~ x i) = 0 and YJ}=\ (yi ~ x i) = 0- Knowing that each block can be 
represented by an affine function Xi = a + J 3i, the case of blocks the above systems can be written as: 

L'Ln l +iyi = u 2 L'Ln 1 +i( i - , n)+P 2 (n-ni)-X ni 

< L?=n l+ iyi = U 2 L?=n 1 + l(i - m) + m) - knl 

!"=„,+1 (* - ni)yi = a 2 ILm+1 (*' - «1 ) 2 + p 2 Ei=m+i n (* ~n\)~ K\ 

< X l n\ + jS 1 — a 2 — jS 2 = 0 

Therefore, for each block the unknown variables to be computed are a, j3, A. The systems to be solved 
can be written as Ax = b, where x=(a l ,P\a 2 ,p 2 ,X ni ),b = yi ,'£"=„ 1 +1 yu E"= ni +t E"= ni +t (*' ~ 

«/)>’/,0) and 
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n\ 
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0 
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x^n—n i • 
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I lUi 2 

L-V 

0 

0 

n\ 

0 

0 

V n 2~n\ -2 
Li= 1 1 

V n 2~n\ • 
Li= 1 1 

0 

\n\ + 1 

l 

-1 

-1 

°/ 


In general, if k is the number of blocks, than the system to be solved has size 3k —1. As observed 


Kuosmanen (2008), in the case of concave regression the number of blocks at the solution is usually 


much lower than n. Therefore, an active set algorithm that start with an empty active set, should found the 
solution without the need of inverting large matrices. 


6 COMPARATIVE RESULTS 

In Tab. [I] we compare qualitatively the algorithms analyzed above in terms of their formulation (primal, 
dual, or primal-dual), their possibility to be initialized and their major limitations in dealing with large 
scale problems. All analyzed methods are dual or primal-dual: dual methods cannot be initialized, 
whereas initialization in primal-dual methods is allowed but constrained. The major limitation of 
asymptotic convergence methods when dealing with large problems is the convergence, whereas time- 
finite convergence methods become too slow and numerical instable because of accumulation errors. In 
the following, we provide evidence of these limitations through numerical simulations reported in the 
Appendix and we compare quantitatively their performances in term of distance from the solution for 
one second, one minute and ten minutes of CPU time. Instead of using only unstructured random data 
as data-test, we considered signals of increasing difficulty level varying from a noised concave signal 
to a concave/convex noised signal. More precisely, we considered three signals, whose equations and 
plots are given in Fig. [3] of the Appendix and, for each of them, we considered three different sizes: 
n G {50,500,1000} and three increasing levels of noise (standard deviations a G {0.01,0.1,0.5}). To 
evaluate robustness against initialization variation for time-finite active set methods, we considerer three 
different initializations of the active set J\ the empty set, the set of m indexes corresponding to the linear 
inequality constraints, the set of not saturated constraints obtained by using the algorithm inspired to 
PAV described in the previous section. In the class of algorithms with asymptotic convergence, the most 
efficient results to be ADMM. This is evident even for very small size data, when using difficult signal 
such as near-convex signals or very noised signals (see Fig. [5]and Fig. [6]of the Appendix). For noised 
concave signals (Fig. [4] of the Appendix), the computational efficiency of ADMM is more evident in 
presence of an high level of noise. Already for signals of size 500 the performance of ADMM are not very 
good: convergence (SSE < 10 -6 ) is not completely attained. The algorithm of Meyer is very sensitive to 
the initialization. It gives good performances when the signal is a Gaussian white noise and the initial 
active set is empty since most of constraints are expected to be saturated at the solution (see Fig. [6] and 
Fig. [9] of the Appendix). Given a good initialization, the MPDB algorithm allows to compute the exact 
solution faster than other methods. However, this algorithm is not numerically stable since it require to 
compute the inverse of a matrix at each iteration. As explained in section [43] this is than incrementally 
by using the Shermann-Morrison formula. However, numerical round-off errors cumulate so that, in 
our implementation, the exact inverse is computed each 150 iterations. As it can be observed in Fig. 


error dominates the calculation and the distance to the solution increases instead of decreasing. These 
results demonstrated that, although the theoretical and practical advances in recent years, the use of 
shape-constrained nonparametric techniques for solving large scale problems (more than many thousand 
of points) is still limited by computational issues. To deal with very large scale problem, up to a million 
of points, matrix inverse calculation should be avoided and more efforts should be devoted to find a way 
to better initialized primal-dual methods, by computing approximate solutions and exploiting multi-scale 
strategies. 


10 , Fig. [IT] and Fig. 12 of the Appendix, that refer to signals of size 1000, sometimes the round-off 


7 CONCLUSIONS 

In this paper we have stated, analyzed and compared qualitatively and quantitatively several optimization 
approaches for solving the cone regression problem. We have distinguished two broad classes of methods. 
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Table 1 . Qualitative comparison of all algorithms 


Algorithm 

Type 

Formulation Initialization 

Limitation 

Hildret 

dual 

9 

not possible 

convergence 

Dykstra 

primal-dual 

7 

constrained 

convergence 

ADMM 

dual 

is 

1 not possible 

convergence 

LSPS 

primal 

25 

constrained 

convergence 

Uzawa 

dual 

27 

not possible 

convergence 

MPDB 

primal-dual 

33 

' constrained 

slow or instable 

Meyer 

primal-dual 

3 

constrained and not robust 

slow or instable 

Active Index 

dual 

34 

not possible 

slow or instable 


On one side, methods with asymptotic convergence that rest on the properties of proximity operators and 
act by finding the solution as the limit of a sequence of successive approximations. On the other side, 
methods with finite-time convergence that exploit the geometric properties of polyhedral convex cones 
and find the exact solution as non-negative linear combination of functions, forming a basis in a specified 
finite dimensional space. Simulations up to one thousand of points have demonstrated that the choice of 
the optimization approach severely impact algorithmic performances. In particular, it has emerged that 
methods with time-finite convergence are much more efficient with respect to asymptotic-convergence 
methods. However, from this study it emerged that they face a twofold difficulty to cope with large-scale 
optimization: the first difficulty arises from the fact that all algorithm of this class modify the active set 
estimate one-by-one, making the adaptation of the initial active set estimation very slow; the second 
difficulty lies in the fact they involve the computation of the inverse or the pseudoinverse of a matrix at 
each variation of the active set. Although there exists many methods to do that efficiently when the matrix 
rank is much lower that the size of the data, this condition cannot be assured in general. Incremental 
strategies to reduce the cost of computing the inverse of a matrix when the inverse of a slightly different 
matrix is known, are bounded by round-off error accumulation in an iterative setting. The results of this 
study suggest that to be able to trait very large-scale problems (up to a million of points) further research 
should focus on finding a way to exploit classical multi-scale strategies and to compute an approximate 
solution through a penalization or splitting method without involving any matrix inverse calculation. 
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8 APPENDIX* 


8.0.1 Signals used for comparative evaluations 

( 2nsin(^z) z=l,...,f 

« + z=| + l,...,y where/3 = jy a = 2«««(|) —/3n, 7 = and 5 = a + /3y— 

[ 7z 3 + <5 Z= y+ 1 


s 2 (x) = ^(p, o 2 ) 

Ss(z) = sinc( — 1) 










Figure 3. Signals Si (top left) and S 2 (top rigth) to which has been added white noise with three 
different values of standard deviation <7 = 0.01, <7 = 0.1, and <7 = 0.5. 


20 . 


31 










































8.0.2 Comparative evaluations on signals of size 50 


distance to solution(L2) 


► murtyFathi 
0 meyer_full 
7 - V“ V meyer_empty 
X— X meyer_PAV 
£r— & mpdb_PAV 







Figure 4. Distance to the solution (L 2 norm) for a signal of type 51 of size 50. From up to the bottom, 
three increasing level of noise have been added. (Left) Algorithms with time-finite convergence: all them 
converge istantanely. (Right) Algorithms with asymptotic convergence: ADMM is more robust to noise. 
LSPS and Dykstra use a parametrization in the primal parameter space which causes numerical round-off 
errors cumulate so that the fitted values does not satisfy the constraints of the dual problem and 
convergence is not attained. 
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distance to solution(L2) 


► murtyFathi 
0 meyer_full 
7-V-V meyer_empty 
X- X— X meyer_PAV 
& — ft — & mpdb_PAV 


time 







distance to solution(L2) 


♦ ♦ murtyFathi 
B—0— 0 meyerjull 
7 - V — V meyer_empty 
X- X—X meyer_PAV 
si? — mpdb_PAV 




\ 


-W- 




Figure 5. Distance to the solution (L 2 norm) for a signal of type S 2 of size 50. From up to the bottom, 
three increasing level of noise have been added. (Left) Algorithms with time-finite convergence: all them 
converge istantanely. (Right) Algorithms with asymptotic convergence: both ADMM and Hildret attain 
the solution when not noise is added but ADMM is much more robust to noise. 
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distance to solution(L2) 


♦ ♦ murtyFathi 
£ — © — © meyerjull 
7-V - V meyer_empty 
'X X— X meyer_PAV 
mpdb PAV 


lo* io 2 io 2 

time 


distance to solution(L2) 


♦ "♦ murtyFathi 
© ® — © meyerjull 
7 - V — V meyer_empty 
X- X — X meyer_PAV 
mpdb PAV 



10 io io io 


time 



distance to solution(L2) 

10 l " 





Figure 6. Distance to the solution (L 2 norm) for Gaussian white noise signals of size 50. (Left) 
Algorithms with time-finite convergence: all them converge istantanely. (Right) Algorithms with 
asymptotic convergence: ADMM is the only algorithm that converges. 
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8.0.3 Comparative evaluations on signals of size 500 




distance to solution(L2) 

10 1 1 



time 


distance to solution(L2) 



distance to solution(L2) 



distance to solution(L2) 



Figure 7. Distance to the solution (L 2 norm) for a signal of type S 1 of size 500. From up to the bottom, 
three increasing level of noise have been added. (Left) Algorithms with time-finite convergence: Meyer’s 
algorithm is very sensitive to the initialization. The best performance are achieved by MPDB with a PAV 
approximate initialization. (Right) Algorithms with asymptotic convergence: ADMM converges only for 
when the signal is slighly noised. 
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distance to solution(L2) 




time 



distance to solution(L2) 



distance to solution(L2) 



distance to solution(L2) 



Figure 8. Distance to the solution (L 2 norm) for a signal of type S 2 of size 500. From up to the bottom, 
three increasing level of noise have been added. (Left) Algorithms with time-finite convergence: the best 
performance are achieved by MPDB with a PAV’s inspired initialization. (Right) Algorithms with 
asymptotic convergence: no algorithm converges. 























































distance to solution(L2) 






Figure 9. Distance to the solution (L 2 norm) for a Gaussian white noise signal of size 500. (Left) 
Algorithms with time-finite convergence: the best performance are achieved by MPDB with a PAV’s 
inspired initialization. (Right) Algorithms with asymptotic convergence: no algorithm converges. 
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8.0.4 Comparative evaluations on signals of size 1000 








Figure 10. Distance to the solution (L 2 norm) for a signal of type S 1 of size 1000. From up to the 
bottom, three increasing level of noise have been added. (Left) Algorithms with time-finite convergence: 
the best performance are achieved by MPDB with a PAV’s inspired initialization. (Right) Algorithms 
with asymptotic convergence: no algorithm converges. 
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distance to solution(L2) 





Figure 11 . Distance to the solution (L 2 norm) for a signal of type S 2 of size 1000. From up to the 
bottom, three increasing level of noise have been added. (Left) Algorithms with time-finite convergence: 
the best performance are achieved by MPDB with a PAV’s inspired initialization. (Right) Algorithms 
with asymptotic convergence: no algorithm converges. 
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Figure 12. Distance to the solution (L 2 norm) for a signal of type S 3 of size 1000. (Left) Algorithms 
with time-finite convergence: the best performance are achieved by MPDB with a PAV’s inspired 
initialization when the round-off error does not dominate the calculation and by the Meyer algorithm 
whose initial active set is empy. This is easy to understand since, for a pure noise signal an high degree of 
freedom is expected at the solution and therefore the empty active set is close to the the final active set. 
(Right) Algorithms with asymptotic convergence: no algorithm converges. 
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Supplementary Materials 

The online supplementary materials contain the pseudocode of the reviewed algorithms as well as 

their implementation in Scilab. 
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